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Abstract 



1 Introduction 



(N 

O 

(N 

O 
D 

Q 
in 



a 

r-H 

> 

o 

cn 

m 
(N 

(N 



X 



Many biological characteristics of evolutionary inter- 
est are not scalar variables but continuous functions. 
Given a dataset of function-valued traits generated 
by evolution, we develop a practical statistical ap- 
proach to infer ancestral function-valued traits, and 
estimate the generative evolutionary process. We do 
this by combining dimension reduction and phyloge- 
netic Gaussian process regression, a nonparametric 
procedure which explicitly accounts for known phy- 
logenetic relationships. We test the methods' perfor- 
mance on simulated function-valued data generated 
from a stochastic evolutionary model. The methods 
are applied assuming that only the phylogeny and the 
function-valued traits of taxa at its tips are known. 
Our method is robust and applicable to a wide range 
of function- valued data, and also offers a phylogeneti- 
cally aware method for estimating the autocorrelation 
of function-valued traits. 

Keywords: comparative analysis; Ornstein- 
Uhlenbcck process; non-parametric Bayesian infer- 
ence; functional phylogenetics; ancestral reconstruc- 
tion; functional Gaussian process regression 
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The number, reliability and coverage of evolutionary 
trees are growing rapidly [U [2]. However, knowing 
organisms' evolutionary relationships through phylo- 
genetics is only one step in understanding the evo- 
lution of their characteristics [3j. Three issues are 
particularly challenging. The first is limited informa- 
tion: empirical information is typically only available 
for extant taxa, represented by tips of a phylogenetic 
tree, whereas evolutionary questions frequently con- 
cern unobserved ancestors deeper in the tree. The 
second is dependence: the available information for 
different organisms in a phylogeny is not indepen- 
dent since a phylogeny describes a complex pattern 
of non-independence; observed variation is a mixture 
of this inherited variation and specific variation [1]. 
The third is high dimensionality: the emerging litera- 
ture on function- valued traits O El [Zj recognises that 
many characteristics of living organisms are best rep- 
resented as a continuous function rather than a single 
factor or a small number of correlated factors. Such 
characteristics include growth or mortality curves [H] , 
reaction-norms [() and distributions [10', where the 
increasing ease of genome sequencing has greatly ex- 
panded the range of species in which distributions 
of gene jlli or predicted protein [12 properties are 
available. Therefore, a function-valued trait is de- 
fined as a phenotypic trait that can be represented 
by a continuous mathematical function S]. 

Previous work [13 proposed an evolutionary model 
for function-valued data d related by a phylogeny T. 
The data are regarded as observations of a phylo- 
genetic Gaussian Process (PGP) at the tips of T. 
That work shows that a PGP can be expressed as a 
stochastic linear operator X on a fixed set cj) of basis 
functions (independent components of variation), so 
that 

d = X^^ (1) 

However, the study does not address the linear in- 
verse problem of obtaining estimates (f> and X oi (j) 
and X: our first contribution in this paper is to pro- 



vide an approach to this problem in section 2.2 via 
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independent principal components analysis (IPCA 

M)- 

We refer to X as the mixing matrix, and to the 
(i, j)th entry of X as the mixing coefficient of the zth 
basis function at the jth taxon. It is these mixing 
coefficients that we model as evolving. For each fixed 
value of i, the are correlated (due to phylogeny) 
as j varies over the taxa; the basis functions them- 
selves do not evolve in our model. 



In section 2.3 we address the problem of estimating 
the statistical structure of the mixing coefficients by 
performing phylogenetic Gaussian process regression 
(PGPR) on each of the rows of X separately. This 
corresponds to assuming independence between the 
rows (i.e. that the coefficients of the different basis 
functions evolve independently). It is commonly ar- 
gued in the quantitative genetics literature |15j that 
evolutionary processes can be modelled as Ornstein- 
Uhlenbeck (OU) processes. Under these assumptions 
the estimation of the forward operator reduces to the 
estimation of a small vector 7 of parameters [T5] . 



In section 2.1 we clarify the interpretation of these 
parameters in evolutionary contexts. The explicit 
PGPR posterior likelihood function is then used to 
obtain maximum likelihood (MLE) estimates for 7. 
The estimation of 7 is known to be a challenging sta- 
tistical problem [16]. We suggest an approach based 



on the principal of bagging [17] in section 2.4 



Our final contribution (section 2.5) addresses the 



problem of estimating the function-valued traits of 
ancestral taxa. The PGPR step above also returns 
a posterior distribution for the mixing coefficient of 
each basis function at each ancestral taxon in the phy- 
logeny. At any particular ancestor the estimated ba- 
sis functions may be combined statistically, using the 
posterior distributions of their respective mixing coef- 
ficients, to provide a function- valued posterior distri- 
bution. Since the univariate posterior distributions 
are Gaussian, and the mixing is linear, the poste- 
rior for the function-valued trait has a closed form 
representation as a Gaussian process (Eq. [?]) which 
provides a major analytical and computational ad- 
vantage for the approach. We can verify the meth- 
ods proposed by using a PGP as a stochastic genera- 
tive model. This simulates correlated function- valued 
traits across the taxa of T. Given only the phylogeny 
and the function-valued traits of taxa at its tips, our 
estimates for and the ancestral functions are then 
compared to the simulation. 

Overall, our three methods (in 2.2 2.4 2.51 ap- 



analysis with the evolutionary dynamics of quan- 
titative phenotypic traits, allowing nonparametric 
Bayesian inference from phylogenetically correlated 
function-valued traits. An outline of the framework 
presented in the current work can be found in Fig. [T] 
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Figure 1: The three methods presented in this paper 
(ovals) and their interrelationships. 



2 Methods &; Implementation 

2.1 Artificial evolution of function- 
valued traits 

We begin by generating a random phylogenetic tree 
T with 128 tips, shown in Fig. [2] This fixes the 
experimental design for our simulation and inference, 
but further simulations given in the Supplementary 
Material confirm that the statistical performance of 
our methods is consistent across a range of choices 
for bfT. Branch length distributions are surprisingly 
consistent across organisms [TH] ; branch lengths were 
drawn from the empirical branch length distributioij^ 
extracted from TreeFam 8.0 f2]. 

Secondly we chose a basis in Eq. [T] We have 
no reason a priori to suppose that this basis is or- 



propriately combine developments in functional data ^Sce Supplementary Material section 1 



2 




Figure 2: The random phylogenetic tree used & ex- 
amples of the function- valued traits shown at the tips 
(extant taxa) and the internal nodes (ancestral taxa). 
A subset of these is used in Fig. [5j 
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thogonal and, in general, there is no reason for our 
inference procedure to be sensitive to the particular 
shape of the basis functions. The three simple non- 
orthogonal, unimodal functions shown in Fig. [3] were 
therefore chosen as examples. For computational pur- 
poses each basis function was stored numerically as a 
vector of length 1024, so that the basis matrix (j) was 
of size 3 X 1024 and its ith row stored the ith basis 
function. 

Thirdly, different mixing coefficients were gener- 
ated by a phylogenetic OU process for each basis 
function and stored in the respective row of X. Our 
modelling assumption is that the mixing coefficients 
for distinct basis functions (/>i,(/)2,03 are statistically 
independent of each other: in Eq. [Tjthis means that 
the rows of X are independent. It is therefore suf- 
ficient to describe the stochastic process generating 
Xi, the zth row of X with i S {1, 2, 3}. We calculated 
the mixing matrix at the 128 tip taxa so X is of size 
3 X 128. The "true" ancestral values were established 
by generating phylogenetic Ornstein-Uhlenbeck (OU) 
processes over the whole phylogeny. The values of 
this process at tip taxa were stored in a row vector 
Xi (Xi is a simulation of the tip taxa mixing coef- 
ficients Xi excluding the non-phylogenetic variation) 
and its values at internal taxa were stored in a row 
vector Wi for performance analyses in section |2.5| 
To simulate the additional effect of non-phylogenetic 
variation (due, for example, to measurement error 
or environmental effects), independent (that is, non- 



Figure 3: From Top Left: Original Basis signals, 0; 
Mixed Sample at the tips, d (four individual function- 
valued traits are shown; red line and grey band show 
respectively the mean and two standard deviations 
for all 128 function- valued data at the tips); IPCA 
Basis, (j); PCA Basis. 

phylogenetic) variation was added to each entry of 
Xf. _ 

Xi = Xi + Ei 

where is a 1 x 128 vector of independent Gaus- 
sian errors with mean and variance cr^ and finally 
the matrix multiplication in Eq. [ijwas performed to 
obtain the simulated data d. The 'extant' function- 
valued trait at tip taxon j is thus Y^^i=iXij(t)i (a 
vector of length 1024), while the ancestral function- 
valued trait at internal taxon g is X]^=i ^ig4>i- The 
ancestral function- valued traits therefore exhibit only 
the phylogenetic part of simulated variation, while 
the extant function-valued traits exhibit both phylo- 
genetic and non-phylogenetic variation. Of course, it 
is not possible to reconstruct non-phylogenetic varia- 
tion using phylogenetic methods: we simulate non- 
phylogenetic variation only to demonstrate that it 
does not prevent the reconstruction of the phyloge- 
netic part of variation for ancestral taxa in sections 
[O to [231 

We now comment on the specific parameters cho- 
sen for the phylogenetic OU processes above. As in 
[T^ we refer to the strength of selection parameter a 
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and the random genetic drift a: we add superscripts 
to these parameters to distinguish between the three 
different OU processes. With this notation, the mix- 
ing coefficients for the row Xi have the following co- 
variance function : 

K^{t,,t2) ^E[X,^X,g] (2) 
=(a})2exp {-2aWT{t„tg)) + 



where = y -"j^ , -Dt (tj,tg) denotes the phylo- 
genetic or patristic distance (that is, the distance in 
T) between the jth and gth tip taxa, cr„ is defined as 
above, and 



, 1 iff tj ~ tg and tj is a tip taxon, 
tj.*3 1 otherwise 



adds non-phylogenetic variation to extant taxa as dis- 
cussed above, ie. 6^ evaluates to 1 only for extant 
taxa, thus (t„ quantifies within-species genetic or en- 
vironmental effects and measurement error in the ith 
mixing coefficient. We see from Eq. |2]that the pro- 
portion of variation in the row Xi attributable to the 



phylogeny is )2+(,^)2 • 

In the Gaussian process regression literature in Ma- 
chine Learning, ^ is equivalent to £, the characteris- 
tic length-scale of decay in the correlation func- 
tion and in the following we work with the latter. 
For all of the OU processes we used characteristic 
length scales relative to 8.22, the maximum patristic 
distance (imax) between two extant taxa for our sim- 
ulated tree (Fig. [2]). The values we used are given 
in Table [Tj In particular, — when i = 2 and it 
follows that the characteristic length scale £ plays no 
role for this OU process, and equally we do not define 
the strength-of-selection parameter a* when i = 2. 



1 2.5 6.17 .5 

2 NA 1 

3 1.5 2.06 .5 



Table 1: The fixed values used for the parameters in 
Eq. [2] to generate the mixing coefficients Xij. Each 
row constitutes a value of 7*. 6.17 & 2.06 correspond 
to .75 and .25 of the tree's tmax respectively. When 
i = 2, t \s not applicable since there is no phyloge- 
netic variation in the sample. 



2.2 Dimensionality reduction and 
source separation for function- 
valued traits 



Given a dataset d of function-valued traits, we would 
like to find appropriate estimates X and (f) of the mix- 
ing matrix X and the basis set (j) respectively. The 
first task is to identify a good linear subspace S of 
the space of all continuous functions by choosing ba- 
sis functions appropriately. The purpose is to work, 
not with the function-valued data directly, but with 
their projections in S. We may say that the chosen 
subspace S is good if the projected data approximate 
the original data well while the number of basis func- 
tions is not unnecessarily large, so that S has the 
'effective' dimension of the data. 



We then face a linear inverse problem: given the 
dataset d of function-valued traits, the task is to gen- 
erate estimates X and (j) (Eq. l|. This task is also 
known as source separation \2\\ . which has a vari- 
ety of implementations making different assumptions 
about the basis (f) and mixing coefficients X. One 
widely used approach is PC A [52], which returns or- 
thogonal sets of basis functions to explain the great- 
est possible variation. PGA has been extended to 
take account of phylogenetic relationships [23] , how- 
ever, if a sample of functions is generated by mixing 
non-orthogonal basis functions, the principal compo- 
nents of the sample (whether or not they account 
for phylogeny) will not equal the basis curves, due 
to the assumption of orthogonality: see Fig. [3] In 
independent components analysis (ICA), the alter- 
native assumption is made that the rows Xi of X are 
statistically independent. This assumption fits more 
naturally with our modelling assumptions, since we 
assume that the rows Xi are mutually independent 
[2T] . IGA has proved fruitful in other biological ap- 
plications [23] as has passing the results of PGA to 
IGA, which has been termed IPGA [Tij . 

PGA is an appropriate tool for identifying the ef- 
fective dimension of a high-dimensional dataset [25) . 
So, to achieve both dimension reduction and source 
separation, we first applied PGA to the dataset d (the 
128 function- valued traits at the tips of T) to deter- 
mine the appropriate number of basis functions. The 
principal components were then passed to the Cu- 
blCA implementation of IGA [26^ . GublGA returned 
a new set of basis functions (Fig. [sj lower-right panel) 
which were taken as the estimated basis 6. 
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2.3 Phylogenetic Gaussian process re- 
gression 

ICA also returns the estimated mixing coefficients at 
tip taxa, X. Our next step was to perform PGPR 
[IB] separately on each row Xi, assuming knowledge 
of the phylogeny T, in order to obtain posterior dis- 
tributions for all mixing coefficients throughout the 
tree T. 

Gaussian process regression (GPR) [20] is a flexible 
Bayesian technique in which prior distributions are 
placed on continuous functions. Its range of priors in- 
cludes the Brownian motion and Ornstein-Uhlenbeck 
(OU) processes, which are by far the most commonly 
used models of character evolution [271 IE] ■ Its im- 
plementation is particularly straightforward since the 
posterior distributions are also Gaussian processes 
and have closed forms. We now give a brief exposi- 
tion of GPR, using notation standard in the Machine 
Learning literature (see, for example, [20]). 

A Gaussian process may be specified by its mean 
surface and its covariance function K{'-f), where 7 is 
a vector of parameters. Since the components of 7 
parameterise the prior distribution, they are referred 
to as /ij/perparameters. The Gaussian process prior 
distribution is denoted 

/^AA(0,X(7)) 

If is a set of unobserved coordinates and x is a 
set of observed coordinates, the posterior distribution 
of the vector f{x*) given the observations f{x) is 

fix*)\fix)-^UiA,B) (3) 

where 

A=K{x*,x,j)K{x,xn)-'f{x), (4) 
B =K{x*,x*,-/) 

—K{x*,x, ^)K{x, x, ^)~^K{x* ,x, 7)"^" (5) 

and K{x*,x,'-f) denotes the \x*\ x |x| matrix of the 
covariance function K evaluated at all pairs x* e 
X*,Xj € X. Equations [4] and [5 convey that the pos- 
terior mean estimate will be a linear combination of 
the given data and that the posterior variance will 
be equal to the prior variance minus the amount that 
can be explained by the data. Additionally, the log- 
likelihood of the sample f{x) is 



logp(/(x)|7) 



---^f{xfK{x,x,jr'f{x) 
-^log(det(if(a;,a;,7)))- Y 



log27r. 



(6) 



It can be seen from Eq. [6] that the maximum like- 
lihood estimate is subject both to the fit it deliv- 
ers (the first term) and the model complexity (the 
second term). Thus, Gaussian process regression is 
non-parametric in the sense that no assumption is 
made about the structure of the model: the more 
data gathered, the longer the vector f{x), and the 
more intricate the posterior model for f{x*). 

PGPR extends the applicability of GPR to evolved 
function-valued traits. A phylogenetic Gaussian pro- 
cess is a Gaussian process indexed by a phylogeny T, 
where the function-valued traits at each pair of taxa 
are conditionally independent given the function- 
valued traits of their common ancestors. When the 
evolutionary process has the same covariance func- 
tion along any branch of T beginning at its root 
(called the marginal covariance function), these as- 
sumptions are sufficient to uniquely specify the co- 
variance function of the PGP, Kt- As we assume 
that T is known in our inverse problem, the only re- 
maining modelling choice is therefore the marginal 
covariance function. As can be seen from Eq. [2j K 
is a function of patristic distances on the tree rather 
than Euclidean distances as standard in spatial GPR. 

In comparative studies, where one has observations 
at the tips of T, the covariance function Kt may be 
used to construct a Gaussian process prior for the 
function-valued traits, allowing functional regression. 
In the model that we use this is equivalent to spec- 
ifying a Gaussian prior distribution for the mixing 
coefficients Yij and Xij. This may be done by re- 
garding the row vectors Yi and Xi as observations of 
a univariate PGP. As noted in [T3] , if we assume that 
the evolutionary process is Markovian and stationary 
then the modelling choice vanishes and the marginal 
covariance function is specified uniquely: it is the sta- 
tionary OU covariance function. If we also add ex- 
plicit modelling of non-phylogenetically related vari- 
ation at the tip taxa, the univariate prior covariance 
function has the unique functional form presented in 
Eq. [2j We do not assume knowledge of the parame- 
ters of Eq. [2] however: their estimation is the subject 
of the next section. 

2.4 Hyperparameter estimation 

Since the posterior distributions returned by PGPR 
depend on the hyperparameter vector 7, we must es- 
timate 7 in order to reconstruct ancestral function- 
valued traits, and the estimation procedure should 
correct for the dependence due to phylogeny. Max- 
imum likelihood estimation (MLE) of the phylo- 
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genetic variation, non-phylogenetic variation and 
characteristic-length-scale hyperparameters crj-, cr^ 
and ^* respectively may be attempted numerically us- 
ing the explicit prior likelihood function (Eq. [6]). Be- 
cause estimating and £^ alone is challenging [T5] 
(although the estimation improves significantly with 
increased sample size) , and we have further increased 
the challenge by introducing non-phylogenetic varia- 
tion, we propose an improved estimation procedure 
using the machine learning technique bagging |17j . 
which a member of the boosting framework f 22| . We 
show that these estimates may be further improved 



if one knows the value of the ratio 



Ml! 



which is 



closely related to Pagel's A [25] , 

Bagging (bootstrap aggregating) seeks to reduce 
the variance of an estimator by generating multiple 
estimates and averaging. It is simple to implement 
given an existing estimation procedure: one adds a 
loop front end that selects a bootstrap sample and 
sends it to the estimation procedure and a back end 
that aggregates the resulting estimates [T7]. We gen- 
erated 100 (sub)trees of 100 taxa by sampling without 
replacement our original 128 taxa tree, obtained the 
MLE for 7 on each subtree, and averaged these esti- 
mates to obtain the aggregated estimate 7. Our re- 
sults are shown in Table [2j for i = 1 and i — 3, given 
our moderate sample size (128 taxa), the accuracy of 
these results is at least in line with the state of the 
art [IS] despite the additional challenge posed by non- 
phylogenetic variation. For i = 2, where phylogenetic 
variation is absent from the generative model (ay=0), 
our estimation procedure indicates its absence hy re- 
turning estimates for whose magnitude is unreal- 
istically small for the examined tree (less than the 
1st percentile of the tree's patristic distances). Com- 
menting further on this matter, exceptionally small 
characteristic length-scales relative to the tree patris- 
tic distances, as seen here, practically suggest taxa- 
specific phylogenetic variation, ie. non-phylogenetic 
variation. This holds also in its reverse: exception- 
ally large characteristic length-scales suggest a stable, 
non-decaying variation across the examined taxa that 
is indifferent to their patristic distances, again sug- 
gesting the absence of phylogenetic variance among 
the nodes. 

To assess the robustness of this hyperparameter es- 
timation method we performed 1024 simulations, ran- 
domly regenerating the tree and parameter vector 7 
each time[^ The accuracy of these estimates is shown 



Fig. |4j Improved results when the ratio ^ 



IS 

known a priori (for example, through knowledge of 
Pagel's A) are also given in the supplementary mate- 
rial (Sect. 2 & 3). Our ultimate aim is ancestor re- 
construction rather than hyperparameter estimation 
per se, and this is the subject of the next section. 



a 



1 3.41 (.62) 

2 0.55 (.33) 

3 2.83 (.33) 



2.83 (.47) 0.78 (.47) 
0.05 (.02) 0.84 (.34) 
2.06 (.50) 0.73 (.29) 



Table 2: The bagging estimates for the hyperparam- 
eters in Eq. [2] (standard deviations of bagging esti- 
mates in parentheses). Each row corresponds to a 
given estimate of the vector 7*. These estimates pro- 
vide the maximum likelihood value for Eq. [6] and are 
comparable with the original ones from Table [T] 



2.5 Ancestor reconstruction 



Having generated function- valued dat a (S ect. 2.11, 



extracted mixing coefficients X (Sect. 2.2) and per- 
formed hyperparameter estima tion (Sect. 2.4 1, we 



may now perform PGPR (Sect. 2.3) on each row X^, 



to obtain the univariate Gaussian posterior distribu- 
tion for the mixing coefficient Wif at any internal 
taxont*. As discussed in Sect. 2.3 the Gaussian pro- 



cess prior distribution has covariance function (Eq. 
[2]). We have asses sed the accuracy of our bagging es- 
timate 7 in Sect. 2.4 and we now substitute 7' into 



Eq. [2j Taking a simple a nd d irect approach, our esti- 
mate (j) obtained in Sect. 2.2 may then be substituted 



See Supplementary Material section 2 



into Eq. [Tjto obtain the function- valued posterior dis- 
tribution /f for the function- valued trait at taxon t*. 
Since our estimated basis functions are stored numer- 
ically as vectors of length 1024, this gives the same 
discretision for the ancestral traits. 

Conditioning on our estimated mixing coefScients 
Xi for the tip taxa, the posterior distribution of Wit* 
is 

where the vector Ai and matrix Bi are obtained from 
Eq.'s [4] and (s) taking x = Xj, x* = Wif and 7* 
respectively for our observation coordinates, estima- 
tion coordinates and hyperparameter vector. Since 
our prior assumption is that the rows of X are statis- 
tically independent of each other, it follows from Eq. 
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Relative Error 



Relative Error 



Relative Error 



Figure 4: Kernel Density estimates of the relative errors in 1024 runs of the 7 estimation procedure, each time 
for a different tree, a different set of mixing coefficients and a different set of parameters in 7; no components 
of 7 are assumed to be known beforehand. Estimation results are commented on in the Discussion. The 
median values shown by the dotted line are (-0.073, -0.131 and 0.001) respectively. 



[Hthat 



■ ^■=1^1 B.. 



(7) 



The marginal distributions of this representation 
(mean and standard deviation) are shown in Fig. [5| 
Fig. [s] compares the function- valued estimates ft* 
to the simulated function-valued traits at the root 
(left panel), an internal node (centre), and at a tip 
(right panel). In the centre and left panels the sim- 
ulated function-valued data is shown in black, and 
can be seen typically to lie within two posterior stan- 
dard deviations. In the right panel, the black line 
is the observed function-valued trait at that tip: the 
red line and dark grey band represent the posterior 
distribution of its phylogenetic component, and the 
light grey band represents the estimated magnitude 
of the additional non-phylogenetic variation. Uncer- 
tainty over the phylogenetic part of variation (dark 
grey band) decreases from root to tip, as all obser- 
vations are at the extant tip taxa. We note that the 
posterior distributions, even at the root, put clear 
statistical constraints on the phylogenetic part of an- 
cestral function-valued data: in this (admittedly sim- 
ulated and highly controlled) setting we can reason 
effectively about ancestral function-valued traits. 

3 Discussion 

In Sec. |2.1| we have appealed to Eq. [TJin the setting 
of mathematical inverse problems where, given data 



d, the challenge is to infer a forward operator G and 
model 6 such that: 



G(0) 



(8) 



and such problems are typically under-determined 
and require additional modelling assumptions j29j. 
Given a phylogeny T and function-valued data d at 
its tips, we wish to infer the forward operator Gt and 
model 6 such that 



d = Gt(0) 



(9) 



When the data d are a small number of corre- 
lated factors per tip taxon, a variety of statistical ap- 
proaches are available (e.g. see [30], [SI])- When the 
data are functions, the phylogenetic Gaussian pro- 
cesses (PGP) [m [32] has been proposed as the for- 
ward operator and this is the approach we have taken 
in this work. 

Our dimensionality reduction methodology in Sect. 
2.2 can be easily varied or extended. For example, 
any suitable implementation of PCA may be used 
to perform the initial dimension reduction step: in 
particular, if the data have an irregular design (as 
happens frequently with function- valued data), the 
method of Yao et al. may be applied to account 
for this; the ICA step then proceeds unchanged. We 
also note that while we find the CubICA implementa- 
tion of ICA to be the most successful in our signal sep- 
aration task, other implementations like FastICA [3T] 
or JADE [31] can also be employed. In general, ICA 
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Root Estimate 



Node Estimate 



Tip Estimate 




Figure 5: Posterior distributions at three points in the phylogeny using the estimated (j) and 7. The prediction 
made by the regression analysis is shown via the posterior mean (red hne), the component of posterior 
variance due to phylogenetic variation (two standard deviations, dark grey band) and non-phylogenetic 
variation (two standard deviation, hght grey band). The black line shows the simulated data enabling visual 
validation of the ancestral predictions. In the right panel, the black line is the training data at a tip taxon 
the red line and dark grey band represent the posterior distribution of its phylogenetic component, while the 
light grey band represents the estimated magnitude of non-phylogenetic variation. The root and internal 
taxon here are the same as those indicated in Figj2] & [3] and the tip is the second from bottom on the same 
figure. 



gives rows Xi of the estimated mixing matrix that 
are maximally independent under a particular mea- 
sure of independence involving, for example, higher 
sample moments or mutual information, in order to 
approximate the solution of the inverse problem in 
Eq. [TJunder our assumption of independence between 
the rows of X. PCA and ICA have different pur- 
poses (respectively, orthogonal decomposition of vari- 
ation and separation of independently mixed signals) 
and we employ them sequentially in IPCA. IPCA is 
nonparametric and, in particular, both distribution- 
ally and phylogenetically agnostic. This means that 
unlike PCA, IPCA is robust to non-Gaussianity in 
the data and, unlike phylogenetically corrected PCA, 
IPCA is robust to mis-specification of the phylogeny 
and to mixed phylogenetic and non-phylogenetic vari- 
ation in the data: any of these can be features of 
biological data. 

It can be seen in Fig. |4] that the estimation of £ 
is more challenging than the estimation of (T„ or ct/, 
having greater bias and variance. This corresponds 
to the documented difficulty of estimating the param- 
eter a in the Ornstein-Uhlenbeck model, particularly 
for smaller sample sizes. Our work on hyperparam- 
eter estimation in Sec. |2.4| mitigates these difficul- 
ties due to small sample size [55] by employing 



bagging in order to bootstrap our sample. Some- 
what unintuively, bagging "works" exactly because 
the subsample 7 estimates are variable and thus we 
avoid overfitted final estimate^ Conceptually our 
work on hyperparameter estimation, when taken to- 
gether with Sec. |2.2[ relates to the character process 
models of |36j and orthogonal polynomial methods of 
[37], which give estimates for the autocovariance of 
function- valued traits. Writing out Eq. [ijfor a single 
function- valued trait (at the jth tip taxon, say), our 
model may be viewed as 

3 3 

/(^) = ^9'>.34>t{x) + ^e,j(l)i{x) (10) 

1=1 i=l 

where the mixing coefficient Xij has been expressed 
as the sum of gtj, the genetic (i.e. phylogenetic) part 
of variation, plus e^ , the non-phylogenetic (eg. en- 
vironmental) part of variation, just as in these ref- 
erences. Then the autocorrelation of the function- 
valued trait is 

E[f{x,)f{x^)] ^ (KO' + (^D') 0.(X1)0.(X2) 

i=l 

(11) 

^See Supplementary Material Section 2 
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The estimates of cr/ and cr" obtained in section 2.4 



may be substituted into Eq. 11 to obtain an estimate 
of the autocovariance of the function- valued traits un- 
der study. This estimate has the attractions both of 
being positive definite (by construction) and of taking 
phylogeny into account. 

Various frameworks exist which could be used to 
generalise the method presented in Sec. 2.4 to 



model heterogeneity of evolutionary rates along the 
branches of a phylogeny [38] or for multiple fixed [15] 
or randomly evolving |39J |16| local optima of the mix- 
ing coefficients. For the stationary OU process the 
optimum trait value appears only in the mean, and 
not in the covariance function, and so does not play 
a role as a parameter in GPR (see [20]). We have 
not implemented such extensions here, effectively as- 
suming that a single fixed optimum is adequate for 
each mixing coefficient. Nonetheless our framework is 
readily extensible to include such effects, either im- 
plicitly through branch-length transformations [40] . 
or explicitly by replacing the OU model with the more 
general Hansen model [55] . 

R Code for the IPCA, ancestral reconstruction and 
hyperparameter estimation is available from 
https : //github . com/ f pgpr/| 
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